1 History and definitions

1.1 What is phylogenetics and when did it start?

Phylogenetics is the study of the evolutionary history and relationships among individuals or groups of organisms

Phylogenetics goes back to the origin of evolution

> Darwin drew this tree

There are many, many applications of phylogenetics:

Applications

There are a lot of learning materials online. Here’s an intro course from EBI: https://www.ebi.ac.uk/training/online/course/introduction-phylogenetics

1.2 What are phylogenetic trees?

A phylogenetic tree is a tree structure (with nodes and edges) in which tips correspond to organisms and internal nodes correspond to inferred common ancestors

Root usually an inferred oldest node, or node separating your study organisms from an “outgroup”. There may not be a root.

load("demotree.Rdata") 
library(ape)
plot(tr, edge.width=2, edge.col="grey",tip.col=c("grey","grey","grey","blue","blue","blue",rep("grey", 6)))

If we are studying the blue organisms we might want to root the tree this way (but we don’t have to – it depends on the question):

tr=root(phy=tr, outgroup=c("Organism 4", "Organism 5", "Organism 6"))
plot(tr,edge.width=2, edge.col="grey",tip.col=c("grey","grey","grey","blue","blue","blue",rep("grey", 6)),main="The same tree, with a different root")

The horizontal axis – branch lengths – are in units of genetic distance, typically something like the number of substitions per site in molecular (sequence) data. Sometimes we create a timed phylogenetic tree. Its branch lengths are in unit of time.

The vertical axis – NOTHING!

In the above tree, Organism 11 is as closely related to Organism 8 as Organism 10 is to Organism 9.

1.3 Terminology

Sister groups: two groups descending from the same ancestor

Clade: an ancestor and all its descendants. This is also sometimes called a monophyletic clade.

Common ancestor (of a set of tips): a node that is an ancestor of all tips in the set

Most recent common ancestor (MRCA) of a set of tips: the common ancestor of the set of tips that is farthest from the root

1.4 Quiz time!

This tree shows groups of animals and also indicates inferred evolutionary events that happened on some of the branches of the tree:

  • Q1: Do the reptiles form a clade?
  • Q2: Do the mammals form a clade?
  • Q3: What is the closest relative of the fish?
  • Q4: According to the groupings on the tree, what feature separates mammals from other organisms?
  • Q5: What feature separates sponges from other organisms?
  • Q6: Are snakes closer to mollusks or to amphibians in this tree?



2 Lots of trees

So: is it practical to find the truly best tree among all the possible trees? (A: not really, but it works quite well. There is usually some uncertainty!)

3 Phylogenetic analysis

3.1 Stages

Phylogenetic analysis usually proceeds in stages like the ones in this cartoon.

We’ll start from step 6 in this lecture. Not that the other steps aren’t important, but 1-3 are specific to your research question. 4 and 5 are in other areas of bioinformatics.

3.2 What data do we need for phylogenetics?

These days, now that sequencing technologies are so affordable, molecular (sequence) data are usually the key data for phylogenetic tree reconstruction.

Why use molecular data?

  1. DNA is inherited material
  2. We can now easily, quickly, inexpensively and reliably sequence genetic material
  3. Sequences are highly specific and rich in information

But we could use morphological or phenotype data

In fact we can also test how consistent trees from genetic and phenotype data are with each other

The characters are the entries of the aligned data. In molecular data, each character is a genetic site or locus. For example, suppose the multiple sequence alignment looks like this:

head demo.fas
>A130
atgaaaccct cgcgccctta tctttaccac gcgtactaca actggatttt agttaatgat
>A131
atgaaaccct tgcgccctta tcttttccac gcgtcctaca actggatttt agataatgat
>A132
atgaaaccct cgcgccctta tctttaccac gcgtactaca actcgatttt agataatgat
>A133
atgaaaccct tgcgccctta tctttgccac gcgtactaca actggatttt agataatgat

Here the organism names are “A130”, “A131”, “A132” and “A133”. There are 60 characters, written in 6 groups of 10. All organisms have an ‘a’ as the first character.

3.3 Distance- and character-based tree building

  1. Distance-based methods

Compute distances between pairs of sequences and find a tree (eg using clustering) that best describes these. example: neighbour joining

In fact the distances are still based on characters, but the characters aren’t considered one at a time - they are pooled together to create pairwise distances between all the sequences in the dataset.

  1. Character-based methods Consider sequences of (DNA, RNA, morphological) characters, one at a time, and seek the tree with an optimal “score”: likelihood or parsimony; or use a Bayesian method to compute a posterior collection of phylogenetic trees.

Can we get the best tree?

Either way, we need to understand how to compare sequences to each other! We need to either compute a distance between all the pairs of sequences, or to figure out a likelihood or score for a tree.

To do these comparisons, we use models that describe how sequences evolve.

4 Models of sequence evolution

4.1 What is this and how does it describe distance?

How can we compute the distance between two sequences?

A common approach: the number of differences per site. For example, for the two sequences “AACCTATCGG” and “AATCTAACGG” differ in two places, so they would be 2/10 = 0.2 substitutions per site apart.

In other words, if the number of differences is \(m\) and the total number of sites is \(n\) then this basic distance is \(d_s=\frac{m}{n}\).

However: what if there were actually 2 changes at the same site? This simple computation would actually underestimate the true distance.

There are many models for sequence evolution. They define the probability that a character at a site will evolve into a different character at that site over a specified time period. They can take various complexities into account. Models of sequence evolution can be used to define distances between two sequences, and they can be used to simulate evolving sequences.

A, C, T and G: adenine, cytosine, guanine, thymine

A and G: 2-ring structures; purine

C and T: 1-ring structures; pyrimidine

Transition: a purine-purine change, or a pyrimidine-pyrimidine change (C-T or A-G)

Transversion: a pyrimidine-purine change (or vice versa)

The Jukes-Cantor model treats these as having the same probability

But it is reasonable to think that transitions might be more probable than transversions

Also, A, C, T and G might not be all equally frequent.

4.2 Assumptions and parameters in several common models

The Jukes-Cantor model and other models take this into account. In the JC model, any possible change is equally likely. It is the simplest model of sequence evolution.

Conceptually, the link is this: the JC model (and higher-complexity variations) define the probability that one sequence evolves into another. The “distance” is related to the amount of time that this would take in the model.

The Jukes-Cantor model is the simplest, but they are all based on homogeneous continuous-time Markov chains. These are beyond the scope of this course. The wikipedia page on these models is quite good as a quick reference: https://en.wikipedia.org/wiki/Models_of_DNA_evolution. “Markov Processes for Stochastic Modeling” by Masaaki Kijima is a math reference.

If the substitution rate is \(\mu\) substitutions per unit time, then the mean number of substitutions in time \(t\) is \(\mu t\). The CTMC has a rate matrix \(Q\) whose off-diagonal entries are the instantaneous rates of transition \(A -> C\), \(A -> T\), \(C -> G\) and so on, and these are all equal to \(\mu/4\). The transition matrix is \(P(t) = e^{Qt}\).

Consider the probability that a change from state \(i\) to state \(j\) is observed in time \(t\), if the substitution rate is \(\mu\). We have a probability \(e^{-\mu t}\) that no event happens in time \(t\). The rate of leaving each state is \(3\mu/4\) since 3 of the 4 “changes” (A to A, C, T, G) arrive at a different base. So the expected number of changes in time \(t\) is \(v = 3/4 \mu t\). We can write \(\mu t = 4/3 v\).

\[ P(i \rightarrow j) = P( i\rightarrow j | \text{no event}) P(\text{no event}) + P( i\rightarrow j | \text{event}) P(\text{event}) \]

\[ P(i \rightarrow j) = 0 ~ P(\text{no event}) + \tfrac{1}{4} (1-e^{\mu t}) = \tfrac{1}{4} - \tfrac{1}{4} e^{\mu t}\] >The probability of any change at the site is 3 times this, because there are 3 choices of base that are different from the current one:

\[ P(\text{change}) = \tfrac{3}{4} - \tfrac{3}{4} e^{\mu t} = \tfrac{3}{4} - \tfrac{3}{4} e^{4/3 v}\] > People use this relationship to relate a distance between two sequences (\(v := d_s\)) to the empirical fraction of the sequence’s total length where there is a change between the two, P(change), by inverting.

The Jukes-Cantor model’s distance is

\[ d_{JC} = -\frac{3}{4}\ln(1-\frac{4}{3}d_s) \]

where \(d_s\) is the same as above (number of differences / sequence length).

There is a pdf at this link that does a nice job of explaining how the Jukes-Cantor model is related to sequence distances http://www.montefiore.ulg.ac.be//Users/ccolijnkvansteen/GBIO0009-1/ac20132014/T4/jc.pdf.

JC: ACTG all equally frequent and all changes equally likely

K80: (Kimura 1980) ACTG all equally frequent, but transitions are more likely than transversions

F81 (Felsenstein 1981): as with JC but ACTG are not all equally frequent

HKY85: (Hasegawa, Kishino and Yano 1985) Transitions and transversions are not equally likely, and ACTG are not all equally frequent

In practice, these models are all implemented within software packages that we use to build phylogenetic trees.

5 Building phylogenetic trees: the central task of phylogenetics

5.1 Neighbour joining

Distances are defined using sequence evolution models

Idea: Join closest tips, make new node, repeat!

Details: differing clustering algorithms:

Single linkage, complete linkage, UPGMA

Software: bionj, nj in R

reference: Saitou and Nei 1987

5.1.1 The neighbour joining algorithm:

Intuition: Start with all the tips separate. Create a new node that joins the two that are closest to each other. Then define the distances between these two tips and the new node. Then define the distances between the new node and all the other tips. Forget about those two tips, but keep the new node. Repeat until everything is joined up.

Suppose there are \(n\) tips and the distance between \(i\) and \(j\) is \(D_{ij}\).

To make our description more precise, we will need to know a couple of key things.(1) How do we choose which pair to join up next? (2) when we make a new node, how far is our new node away from the two tips it is joining? (3) how far away is our new node from all the other tips?

Two definitions before we start

\[Q_{ij} = (n-2) D_{ij} - \sum_{k=1}^n D_{ik} - \sum_{k=1}^n D_{jk}\]

If \(E\) is the new node joining say \(A\) and \(B\), let

\[ D_{EA} = \frac{1}{2}D_{AB} + \frac{1}{2(n-2)}\left(\sum_{k=1}^n D_{Ak} -\sum_{k=1}^n D_{Bk} \right)\]

The NJ algorithm

Begin with a collection of sequences, unlinked. Compute the distance matrix \(D_{i,j}\).

While there remain nodes that are not joined up:

  1. Calculate \(Q_{ij}\) for all pairs \((i,j)\), using the current distance matrix (starting from the distance matrix on all the tips)
  1. Find the pair of taxa \((A,B)\) with the lowest \(Q\) value, \(Q_{A,B}\). If there’s more than one such pair, choose one at random. The \(Q\) criterion comes from wanting to choose \(A\) and \(B\) to minimize the total branch length of the tree in the next step. See Saitou and Nei 1987: https://doi.org/10.1093/oxfordjournals.molbev.a040454.
  1. Create a new node, temporarily called E. This node will join A and B.
  1. Use the formula for \(D_{EA}\) above to compute the distance between E and A. Do the same for \(D_{EB}\).
  2. Calculate the distance between E and all the other nodes in the tree (which are not A or B). (This is easy given the distance matrix and the result of step 4).
  1. Update the distance matrix: remove the columns and rows corresponding to the temporary A and B and add a row and column for node E. This way, the size of the distance matrix reduces by 1.

Repeat 1 - 6. Eventually there are only two nodes left to join.

NJ strengths Can use realistic substitution model; Computationally efficient: no need to compare lots of trees for an optimality condition; Great for large datasets, short distances.

Tree-like distances: If the distances are perfect (and reflect the true tree) the tree is guaranteed to be correct.

Weaknesses: If the distances aren’t perfect then the tree may well be wrong. In some circumstances the branch lengths can be negative! The method is sensitive to gaps in the sequences. The model does not allow for different sites to have different “matches” to the tree - they all just contribute to the distance. Distance measures may be inaccurate for large distances.

5.2 Maximum parsimony

Based on character/sequence evolution on tree

Find the tree that minimises the number of changes you need to describe each character

Here, the tree on the right requires 4 changes. The one on the left only needs 2. So the one on the left is more parsimonious.

Strengths: Simple to score and understand; Computationally efficient (because it’s simple)

Weaknesses: Lack of explicit assumptions (eg hypervariable sites? Sites with many changes?); No branch length or timing information; Not statistically consistent Software: PAUP, MEGA, TNT

5.3 Maximum likelihood

Likelihood: probability (or likelihood) or data given parameters of an underlying model, \(L(D | M)\)

These are computed using the same continuous time Markov chains described above: Jukes-Cantor and so on.

These CTMCs give the probability that (for example) an A evolved from a C on a branch of length \(t\), \(P_{a,c}(t)\).

For large samples: max likelihood estimates of parameters (MLEs) are unbiased, consistent and efficient

Felsenstein pruning algorithm makes it feasible to sum over the unknown states at the internal nodes

For example, the likelihood on the little demo tree for site 1 is the likelihood that the root (node 1) is what in state \(s_1\), that it evolves from there to state \(c\) over branch length \(v_c\), and s1 evolves to s2 in length \(v_{12}\). Also, node 2 has to evolve from state \(s2\) to state ‘a’ at site 1, on the branch between node 2 and taxon A. And so on.

Just as \(P_{ij}(t)\) was the probability that \(i \rightarrow j\) in time \(t\) in the JC model, here, define \(P_{s1, s2}(v)\) to be the probability that state \(s1\) transitions to state \(s2\) on a branch of length \(v\). Let \(\pi_{s1}\) be the probability of state \(s1\); in the Jukes Cantor model \(\pi_{s1}=1/4\) for all four states.

Since we don’t know the sequences at the internal nodes ‘node 1’ and ‘node 2’, they could be any of the 4 bases.

\[ L(\text{site 1}) = \sum_{s1\in \{a,c,t,g\}}\sum_{s2\in \{a,c,t,g\}} \pi_{s1} P_{s1, a}(v_c) P_{s1, s2}(v_{12}) P_{s2,a}(v_a) P_{s_2,a}(v_b) \]

The key observation in Felsenstein’s pruning algorithm is that we can group these sums together:

\[ L(\text{site 1}) = \sum_{s1\in \{a,c,t,g\}}\pi_{s1} P_{s1, a}(v_c)\sum_{s2\in \{a,c,t,g\}} P_{s1, s2}(v_{12}) P_{s2,a}(v_a) P_{s_2,a}(v_b) \]

We can do this for every single site (in this example, there are 4 sites) to get the tree likelihood:

\[ L(\text{data}|\text{tree}) = \prod_j L(\text{site j}) \]

Statistically, the topology is a model and the rates of substitution and so on are parameters. One of the huge triumphs of maximum likelihood models is the ability to estimate these parameters by computing the above likelihood and maximizing it. This requires sampling over the space of possible trees, which is BIG.

Central assumptions used in defining the likelihood:

  1. Independent evolution at different sites, and (2) Independent evolution in different branches

\[L = \prod_{i \in sites} L_i(D_i |M)\]

where \(D_i\) is the data for the characters at site \(i\)

Real datasets may break both assumptions! So be careful

See Felsenstein 1981, Felsenstein book: Inferring phylogenies

Software: RAxML, PhyML, R (phangorn), IQtree, FastTree

Strengths: rich repertoire of sophisticated models. Can estimate parameters of underlying models. This means we can estimate the mutation rate, transition/transversion ratio, etc.

Remember in the demo tree about organisms? ML methods allow us to estimate what the sequence (or phenotype, if we use phenotype data) was like at internal points in the tree.

Weaknesses: primarily computational cost.

5.4 Bayesian methods

Bayes’ theorem: \(P(M|D) = \frac{P(D|M) P(M)}{P(D)}\) where M: model, D: data.

\(P(D|M)\) is the same as the likelihood

\(P(M)\) : prior information

Use MCMC to generate a sample from the posterior

Software: Beast, MrBayes

Strengths: Clear re: uncertainty. The posterior collection of trees includes trees supported by the data, with tree topologies more often included if they are better supported.

Note very different philosophy to ML. There are many flexible models of evolution (as in ML) Parameter estimation (as in ML). Bayesian approaches take prior knowledge into account.

Weaknesses

Posterior probabilities appear too high: why? Sensitive to model violations;

Prior knowledge hard to obtain; people use defaults in software but innocent-looking priors can really drive the posterior

Computationally slow, intractable past a few hundred sequences

6 Tree quality

Three things to consider:

  1. Consistency: approaches true value if the amount of data grows. Parsimony may not be consistent. MLEs of parameters usually are
  1. Efficiency: unbiased estimate with lowest variance. Usually people look at the probability of recovering the correct tree as the number of sites increases
  1. Robustness: correct answer even if assumptions are violated?

You will probably also consider speed! Broadly: ML beats parsimony for molecular data

7 Bootstrapping

Resample columns of your data (so some are left out, others repeated)

Repeat your tree reconstruction

Perform this many times

For each edge in your original tree: how many times does its descending clade occur among your bootstrap trees?

Bootstrap numbers don’t have a clear statistical interpretation. But bootstrapping is very important to understand tree uncertainty.

Not so good if sites aren’t independent. Why? –because you get similar data on different bootstraps, more than you would if the sites were independent. So nodes can have “high bootstrap values” (ie a split in a tree occurs often) not because the tree is a particularly great model but because the bootstrap resampling isn’t doing what it would be doing if the independence condition were met.

8 Resources

Yang Z, Rannala B. Molecular phylogenetics: principles and practice. Nat Rev Genet. 2012;13: 303–314. doi:10.1038/nrg3186

Felsenstein book: Inferring Phylogenies https://www.amazon.co.uk/Inferring-Phylogenies-Joseph-Felsenstein/dp/0878931775

EBI course: http://www.ebi.ac.uk/training/online/course/introduction-phylogenetics

Neighbour joining revealed (Gascuel and Steel 2006) https://academic.oup.com/mbe/article/23/11/1997/1322446

Many other online resources are freely available